Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libvirt: Enable multiple PodVM image scenario #2061

Merged

Conversation

ajaypvictor
Copy link
Contributor

Enable support for multiple PodVM images for libvirt provider with the merge of kata-containers/kata-containers#10274

Fixes #1282

Working Scenario:
Tried with the following busybox manifest:

apiVersion: v1
kind: Pod
metadata:
  name: busybox
  namespace: openshift-sandboxed-containers-operator
  labels:
    app: busybox
  annotations:
    io.katacontainers.config.hypervisor.image: "pppvm-nov-vol.qcow2"
spec:
  containers:
  - image: quay.io/prometheus/busybox
    imagePullPolicy: Always
    name: busybox-container
    ports:
    - containerPort: 80
      protocol: TCP
  restartPolicy: Always
  runtimeClassName: kata-remote

Here the pppvm-nov-vol.qcow2 is a valid libvirt volume and a podvm image is uploaded there.

CAA logs:

2024/09/26 09:20:17 [adaptor/cloud] initdata in Pod annotation:
2024/09/26 09:20:17 [adaptor/cloud] initdata in pod annotation is empty, use global initdata:
2024/09/26 09:20:17 [adaptor/cloud] create a sandbox 20a284a33e1e6222844abd4f92bda5ee7b897d752f2b3a505f4f32b5ef5eb33e for pod busybox in namespace openshift-sandboxed-containers-operator (netns: /var/run/netns/7745570f-3b24-4f0e-a623-e060ded3f3db)
2024/09/26 09:20:17 [adaptor/cloud/libvirt] LaunchSecurityType: None
2024/09/26 09:20:17 [adaptor/cloud/libvirt] Choosing pppvm-nov-vol.qcow2 as libvirt volume for the PodVM image
2024/09/26 09:20:17 [adaptor/cloud/libvirt] Checking if instance (podvm-busybox-20a284a3) exists
2024/09/26 09:20:17 [adaptor/cloud/libvirt] Uploaded volume key /var/lib/libvirt/images/kata-pod-vms/podvm-busybox-20a284a3-root.qcow2
2024/09/26 09:20:17 [adaptor/cloud/libvirt] Create cloudInit iso
2024/09/26 09:20:17 [adaptor/cloud/libvirt] Uploading iso file: podvm-busybox-20a284a3-cloudinit.iso
......
......
2024/09/26 09:21:28 [adaptor/proxy]         io.katacontainers.pkg.oci.container_type: pod_container
2024/09/26 09:21:28 [adaptor/proxy]         io.kubernetes.cri-o.Image: quay.io/prometheus/busybox@sha256:59d0ed3060aef57d1b23bc353a2223af24a6e1d035486647eb599a77ff2d446e
2024/09/26 09:21:28 [adaptor/proxy]         io.kubernetes.cri-o.Labels: {"io.kubernetes.container.name":"busybox-container","io.kubernetes.pod.name":"busybox","io.kubernetes.pod.namespace":"openshift-sandboxed-containers-operator","io.kubernetes.pod.uid":"ff0368c7-c9f1-4c2f-9c95-3a8a168c6ed8"}
2024/09/26 09:21:28 [adaptor/proxy]         io.katacontainers.config.hypervisor.image: pppvm-nov-vol.qcow2
2024/09/26 09:21:28 [adaptor/proxy]         io.kubernetes.container.restartCount: 0

Failure Scenario:
Tried with the following busybox manifest:

apiVersion: v1
kind: Pod
metadata:
  name: busybox
  namespace: openshift-sandboxed-containers-operator
  labels:
    app: busybox
  annotations:
    io.katacontainers.config.hypervisor.image: "rhel9-os"
spec:
  containers:
  - image: quay.io/prometheus/busybox
    imagePullPolicy: Always
    name: busybox-container
    ports:
    - containerPort: 80
      protocol: TCP
  restartPolicy: Always
  runtimeClassName: kata-remote

Here the rhel9-os is a dummy variable and not an actual libvirt volume.

CAA logs:

2024/09/26 09:29:09 [adaptor/cloud] create a sandbox 57bbf987d92ff21b508adddff69ef36a400f6ba7e9dfaabf13c1b54897aca0d3 for pod busybox in namespace openshift-sandboxed-containers-operator (netns: /var/run/netns/dd971e3c-d1e8-4156-8832-35ead351c3b0)
2024/09/26 09:29:09 [adaptor/cloud/libvirt] LaunchSecurityType: None
2024/09/26 09:29:09 [adaptor/cloud/libvirt] Choosing rhel9-os as libvirt volume for the PodVM image
2024/09/26 09:29:09 [adaptor/cloud/libvirt] Checking if instance (podvm-busybox-57bbf987) exists
2024/09/26 09:29:09 [adaptor/cloud/libvirt] failed to create an instance : Error in creating volume: Can't retrieve volume rhel9-os
2024/09/26 09:29:09 [adaptor/cloud] error starting instance: creating an instance : Error in creating volume: Can't retrieve volume rhel9-os
2024/09/26 09:29:09 [adaptor/proxy] shutting down socket forwarder

@ajaypvictor ajaypvictor requested a review from a team as a code owner September 26, 2024 09:33
@stevenhorsman
Copy link
Member

I think this need to be draft until #2036 which bumps us to the 3.9.0 version of the kata runtime is merged?

@ajaypvictor ajaypvictor marked this pull request as draft September 26, 2024 09:47
@ajaypvictor ajaypvictor marked this pull request as ready for review September 30, 2024 01:38
@snir911
Copy link
Contributor

snir911 commented Sep 30, 2024

I missed the kata-containers PR, do we expect other providers to use the same annotations for their podvm images? e.g. in AWS, will io.katacontainers.config.hypervisor.image point to an AMI image id?

@ajaypvictor
Copy link
Contributor Author

I missed the kata-containers PR, do we expect other providers to use the same annotations for their podvm images? e.g. in AWS, will io.katacontainers.config.hypervisor.image point to an AMI image id?

Hi Snir,
As you correctly pointed it's expected to have this annotation for other providers as well.
The discussion about this happened on the kata-dev slack channel.

@ajaypvictor ajaypvictor force-pushed the multiple-podvm-images branch 2 times, most recently from 077b833 to 4f00cce Compare October 3, 2024 15:37
@ajaypvictor
Copy link
Contributor Author

e2e test results:

test:~/cloud-api-adaptor/src/cloud-api-adaptor# RUN_TESTS=TestLibvirtCreatePeerPodContainerWithAlternateImage TEST_PROVISION=no TEST_INSTALL_CAA=no make TEST_PROVISION_FILE=/root/cloud-api-adaptor/src/cloud-api-adaptor/libvirt.properties CLOUD_PROVIDER=libvirt test-e2e
go test -v -tags=libvirt -timeout 60m -count=1 -run TestLibvirtCreatePeerPodContainerWithAlternateImage ./test/e2e
time="2024-10-03T15:36:01Z" level=info msg="Do setup"
time="2024-10-03T15:36:01Z" level=info msg="Container runtime: containerd"
time="2024-10-03T15:36:01Z" level=info msg="Creating namespace 'coco-pp-e2e-test-5ef29963'..."
time="2024-10-03T15:36:01Z" level=info msg="Wait for namespace 'coco-pp-e2e-test-5ef29963' be ready..."
time="2024-10-03T15:36:06Z" level=info msg="Wait for default serviceaccount in namespace 'coco-pp-e2e-test-5ef29963'..."
time="2024-10-03T15:36:06Z" level=info msg="default serviceAccount exists, namespace 'coco-pp-e2e-test-5ef29963' is ready for use"
=== RUN   TestLibvirtCreatePeerPodContainerWithAlternateImage
=== RUN   TestLibvirtCreatePeerPodContainerWithAlternateImage/PodVMwithAnnotationsAlternateImage_test
=== RUN   TestLibvirtCreatePeerPodContainerWithAlternateImage/PodVMwithAnnotationsAlternateImage_test/Failed_to_Create_PodVM_with_a_non-existent_image
=== NAME  TestLibvirtCreatePeerPodContainerWithAlternateImage/PodVMwithAnnotationsAlternateImage_test
    assessment_runner.go:617: Deleting pod annotations-instance-type...
    assessment_runner.go:624: Pod annotations-instance-type has been successfully deleted within 60s
--- PASS: TestLibvirtCreatePeerPodContainerWithAlternateImage (10.03s)
    --- PASS: TestLibvirtCreatePeerPodContainerWithAlternateImage/PodVMwithAnnotationsAlternateImage_test (10.03s)
        --- PASS: TestLibvirtCreatePeerPodContainerWithAlternateImage/PodVMwithAnnotationsAlternateImage_test/Failed_to_Create_PodVM_with_a_non-existent_image (0.00s)
PASS
time="2024-10-03T15:36:16Z" level=info msg="Deleting namespace 'coco-pp-e2e-test-5ef29963'..."
time="2024-10-03T15:36:26Z" level=info msg="Namespace 'coco-pp-e2e-test-5ef29963' has been successfully deleted within 60s"
ok  	github.com/confidential-containers/cloud-api-adaptor/src/cloud-api-adaptor/test/e2e	25.107s

@@ -406,6 +406,32 @@ func DoTestPodVMwithAnnotationsLargerCPU(t *testing.T, e env.Environment, assert
NewTestCase(t, e, "PodVMwithAnnotationsLargerCPU", assert, "Failed to Create PodVM with Annotations Larger CPU").WithPod(pod).WithInstanceTypes(testInstanceTypes).WithCustomPodState(v1.PodPending).Run()
}

func DoTestCreatePeerPodContainerWithAlternateImage(t *testing.T, e env.Environment, assert CloudAssert) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the negative test is a good start to check that the code path works. I'm wondering if we could use the provisioner.UploadPodvm logic to upload another podvmImage if a test image is provided (I guess the same image with a different name might be the best we can hope for) and then switch the annotation to try using that. I guess the problem then becomes how we check that the alternate podvm is being used might be tricky?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that's true.. but if checking the CAA logs and ensuring that the PodVM picked up is the alternate one as per the annotation sounds good, then I guess we can try that option?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - the CAA logs would be a good place to check - I didn't think about that. I'd say it's probably worth spending a bit of time to investigate this approach at least then

@ajaypvictor ajaypvictor force-pushed the multiple-podvm-images branch from b18e30a to f451cd9 Compare October 8, 2024 09:26
@ajaypvictor
Copy link
Contributor Author

Latest e2e test results:

~/cloud-api-adaptor/src/cloud-api-adaptor# RUN_TESTS="^TestLibvirtCreatePeerPodContainerWith\(Valid\|Invalid\)AlternateImage$" TEST_PROVISION=no TEST_INSTALL_CAA=no make TEST_PROVISION_FILE=/root/cloud-api-adaptor/src/cloud-api-adaptor/libvirt.properties CLOUD_PROVIDER=libvirt test-e2e
go test -v -tags=libvirt -timeout 60m -count=1 -run ^TestLibvirtCreatePeerPodContainerWith\(Valid\|Invalid\)AlternateImage$ ./test/e2e
time="2024-10-08T09:35:56Z" level=info msg="Do setup"
time="2024-10-08T09:35:56Z" level=info msg="Container runtime: containerd"
time="2024-10-08T09:35:56Z" level=info msg="Creating namespace 'coco-pp-e2e-test-0e4be14f'..."
time="2024-10-08T09:35:56Z" level=info msg="Wait for namespace 'coco-pp-e2e-test-0e4be14f' be ready..."
time="2024-10-08T09:36:01Z" level=info msg="Wait for default serviceaccount in namespace 'coco-pp-e2e-test-0e4be14f'..."
time="2024-10-08T09:36:01Z" level=info msg="default serviceAccount exists, namespace 'coco-pp-e2e-test-0e4be14f' is ready for use"
=== RUN   TestLibvirtCreatePeerPodContainerWithValidAlternateImage
=== RUN   TestLibvirtCreatePeerPodContainerWithValidAlternateImage/PodVMwithAnnotationsValidAlternateImage_test
    assessment_runner.go:270: Waiting for containers in pod: annotations-valid-alternate-image are ready
=== RUN   TestLibvirtCreatePeerPodContainerWithValidAlternateImage/PodVMwithAnnotationsValidAlternateImage_test/PodVM_created_with_an_alternate_image
    assessment_runner.go:569: PodVM was brought up using the alternate PodVM image another-podvm-base.qcow2
=== NAME  TestLibvirtCreatePeerPodContainerWithValidAlternateImage/PodVMwithAnnotationsValidAlternateImage_test
    assessment_runner.go:650: Deleting pod annotations-valid-alternate-image...
    assessment_runner.go:657: Pod annotations-valid-alternate-image has been successfully deleted within 60s
--- PASS: TestLibvirtCreatePeerPodContainerWithValidAlternateImage (95.05s)
    --- PASS: TestLibvirtCreatePeerPodContainerWithValidAlternateImage/PodVMwithAnnotationsValidAlternateImage_test (95.05s)
        --- PASS: TestLibvirtCreatePeerPodContainerWithValidAlternateImage/PodVMwithAnnotationsValidAlternateImage_test/PodVM_created_with_an_alternate_image (0.03s)
=== RUN   TestLibvirtCreatePeerPodContainerWithInvalidAlternateImage
=== RUN   TestLibvirtCreatePeerPodContainerWithInvalidAlternateImage/PodVMwithAnnotationsInvalidAlternateImage_test
=== RUN   TestLibvirtCreatePeerPodContainerWithInvalidAlternateImage/PodVMwithAnnotationsInvalidAlternateImage_test/Failed_to_Create_PodVM_with_a_non-existent_image
=== NAME  TestLibvirtCreatePeerPodContainerWithInvalidAlternateImage/PodVMwithAnnotationsInvalidAlternateImage_test
    assessment_runner.go:650: Deleting pod annotations-invalid-alternate-image...
    assessment_runner.go:657: Pod annotations-invalid-alternate-image has been successfully deleted within 60s
--- PASS: TestLibvirtCreatePeerPodContainerWithInvalidAlternateImage (10.03s)
    --- PASS: TestLibvirtCreatePeerPodContainerWithInvalidAlternateImage/PodVMwithAnnotationsInvalidAlternateImage_test (10.03s)
        --- PASS: TestLibvirtCreatePeerPodContainerWithInvalidAlternateImage/PodVMwithAnnotationsInvalidAlternateImage_test/Failed_to_Create_PodVM_with_a_non-existent_image (0.00s)
PASS
time="2024-10-08T09:37:47Z" level=info msg="Deleting namespace 'coco-pp-e2e-test-0e4be14f'..."
time="2024-10-08T09:37:57Z" level=info msg="Namespace 'coco-pp-e2e-test-0e4be14f' has been successfully deleted within 60s"
ok  	github.com/confidential-containers/cloud-api-adaptor/src/cloud-api-adaptor/test/e2e	120.159s

Copy link
Member

@stevenhorsman stevenhorsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few refactor suggestions, but otherwise it looks good. Thanks!

Comment on lines 556 to 570
var podlist v1.PodList
var podLogString string
expectedSuccessMessage := "Choosing " + tc.alternateImageName
if err := client.Resources("confidential-containers-system").List(ctx, &podlist); err != nil {
t.Fatal(err)
}
for _, pod := range podlist.Items {
if pod.Labels["app"] == "cloud-api-adaptor" {
podLogString, _ = GetPodLog(ctx, client, pod)
break
}
}
if strings.Contains(podLogString, expectedSuccessMessage) {
t.Logf("PodVM was brought up using the alternate PodVM image %s", tc.alternateImageName)
} else {
t.Logf("cloud-api-adaptor pod logs: %s", podLogString)
yamlData, err := yaml.Marshal(tc.pod.Status)
if err != nil {
log.Errorf("Error marshaling pod.Status to JSON: %s", err.Error())
} else {
t.Logf("Current Pod State: %v", string(yamlData))
}
t.Fatal("failed due to unknown reason")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional request: Could we move this section to assessment_helper.go as a new function? I'm trying to reduce how massive the assessment_runner has got with this strategy in some WIP commits I'm working on on the side.

@@ -68,6 +69,9 @@ func NewLibvirtProvisioner(properties map[string]string) (pv.CloudProvisioner, e
vol_name = properties["libvirt_vol_name"]
}

// alternate_vol_name is used for multi-podvm testing.
alternate_vol_name := "another-podvm-base.qcow2"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a possibility to extract this as a constant, so we can reference it in libvirt_test rather than having to copy the value there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I tried to do this way, but unfortunately common.go inside the e2e folder is not extending to this provision_common.go, also thought about adding it to common.go under provisioner, which is again not referenced on libvirt_test/common_suite under e2e. if it sounds fair to import, i can try this out..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. One approach we could take (which I'm not sure is even good) is create the const in libvirt/provision_common.go and then we can reference it in libvirt_test.go and pass it in as an argument to the common_test.go version? Otherwise I think we can leave it for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Steve, this sounds like an idea, will test it out..

@@ -135,6 +136,11 @@ func (tc *TestCase) WithInstanceTypes(testInstanceTypes InstanceValidatorFunctio
return tc
}

func (tc *TestCase) WithMultiplePodVM(alternateImageName string) *TestCase {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if WithMultiplePodVM is the best name here? Maybe something like WithAlternateImage would be clearer?

@ajaypvictor ajaypvictor force-pushed the multiple-podvm-images branch 2 times, most recently from fde68ba to 5f5ef7e Compare October 8, 2024 17:36
@ajaypvictor
Copy link
Contributor Author

ajaypvictor commented Oct 9, 2024

@stevenhorsman Would it be fair to add test_e2e_libvirt label now, if we are good to test the PR via the pipeline?

Copy link
Member

@stevenhorsman stevenhorsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of refactor things to look into.

@@ -525,3 +525,31 @@ func AddImagePullSecretToDefaultServiceAccount(ctx context.Context, client klien
}
return nil
}

func VerifyCloudAPIAdaptorLogs(ctx context.Context, client klient.Client, t *testing.T, pod *v1.Pod, expectedSuccessMessage string) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I wasn't super clear with the refactor suggestion, I was thinking of including the "Choosing ..." logic in the helper method as well. I think you've done too good a job and the generic VerifyCloudAPIAdaptorLogs is a great idea, so can you review other methods to see if it can be used elsewhere e.g. IsPulledWithNydusSnapshotter please.

but it ComparePodLogString

Comment on lines 533 to 552
if err := client.Resources("confidential-containers-system").List(ctx, &podlist); err != nil {
return err
}
for _, pod := range podlist.Items {
if pod.Labels["app"] == "cloud-api-adaptor" {
podLogString, _ = GetPodLog(ctx, client, pod)
break
}
}
if strings.Contains(podLogString, expectedSuccessMessage) {
t.Logf("cloud-api-adaptor logs are matching with the expected message")
} else {
yamlData, err := yaml.Marshal(pod.Status)
if err != nil {
t.Fatalf("Error marshaling pod.Status to JSON: %s", err.Error())
} else {
t.Logf("Current Pod State: %v", string(yamlData))
t.Fatal("failed due to unknown reason")
}
return errors.New("failed due to unknown reason")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the pod app being hard-coded (which might be enough of an issue to make this not re-usable), this looks pretty similar to ComparePodLogString, so I wonder if you could call ComparePodLogString from within this to reduce duplication?

@stevenhorsman stevenhorsman added the test_e2e_libvirt Run Libvirt e2e tests label Oct 9, 2024
@ajaypvictor ajaypvictor force-pushed the multiple-podvm-images branch from 5f5ef7e to 5634221 Compare October 9, 2024 13:13
@@ -203,11 +203,15 @@ func (s *cloudService) CreateVM(ctx context.Context, req *pb.CreateVMRequest) (r
// Get Pod VM cpu and memory from annotations
vcpus, memory := util.GetCPUAndMemoryFromAnnotation(req.Annotations)

// Get Pod VM image from annotations
image := util.GetImageFromAnnotation(req.Annotations)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the generic changes be in a separate commit to distinguish from provider specific implementation ?

Copy link
Member

@stevenhorsman stevenhorsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good, we just need to wait until the libvirt e2e tests are back and confirm the tests work.

Copy link
Member

@bpradipt bpradipt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
I have no means to test this.

@stevenhorsman
Copy link
Member

@ajaypvictor can you rebase this PR please and hopefully the tests should run properly then. Thanks!

@ajaypvictor ajaypvictor force-pushed the multiple-podvm-images branch 2 times, most recently from bc5cf2a to c134ef1 Compare October 12, 2024 03:13
@stevenhorsman
Copy link
Member

It looks like in the e2e testing TestLibvirtCreatePeerPodWithJob is failing and having knock-on impact to the rest of the tests. I'll try and look into this manually if I get a chance...

@stevenhorsman
Copy link
Member

The job test is passing locally for me:

=== RUN   TestLibvirtCreatePeerPodWithJob
=== RUN   TestLibvirtCreatePeerPodWithJob/JobPeerPod_test
=== RUN   TestLibvirtCreatePeerPodWithJob/JobPeerPod_test/Job_has_been_created
    assessment_helpers.go:292: SUCCESS: job-pi-4t6v7 - Completed - LOG: 3.14156
    assessment_runner.go:342: Output Log from Pod: 3.14156
=== NAME  TestLibvirtCreatePeerPodWithJob/JobPeerPod_test
    assessment_runner.go:625: Deleting Job... job-pi
    assessment_runner.go:632: Deleting pods created by job... job-pi-4t6v7
--- PASS: TestLibvirtCreatePeerPodWithJob (100.09s)
    --- PASS: TestLibvirtCreatePeerPodWithJob/JobPeerPod_test (100.09s)
        --- PASS: TestLibvirtCreatePeerPodWithJob/JobPeerPod_test/Job_has_been_created (0.02s)
PASS

@stevenhorsman
Copy link
Member

stevenhorsman commented Oct 14, 2024

I've just spotted that the CI e2e test show failure logs for the deletion-test of:

Warning  FailedCreatePodSandBox  3m52s (x26 over 9m43s)  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: remote hypervisor call failed: rpc error: code = Unknown desc = creating an instance : Error in creating volume: Can't retrieve volume non-existing-image: unknown

which doesn't seem right...

I've confirmed locally that running TestLibvirtCreatePeerPodWithJob after the new tests hits this error, so it's a bug with the code/test, not a CI flake.

@stevenhorsman
Copy link
Member

stevenhorsman commented Oct 14, 2024

@ajaypvictor - I've been able to reproduce this issue by doing export RUN_TESTS="\"(TestLibvirtCreatePeerPodContainerWithInvalidAlternateImage|TestLibvirtCreatePeerPodWithJob)\"". Describing the job shows that it doesn't have annotations, but is trying to create a podvm with volume non-existing-image, so I guess there might be some error in the code that is caching the annotation setting, or something?

Here is the caa log for the relevant section:

2024/10/14 16:53:05 [adaptor/cloud/libvirt] Choosing non-existing-image as libvirt volume for the PodVM image
2024/10/14 16:53:05 [adaptor/cloud/libvirt] Checking if instance (podvm-annotations-invalid-alternate-image-cd0a9e67) exists
2024/10/14 16:53:05 [adaptor/cloud/libvirt] failed to create an instance : Error in creating volume: Can't retrieve volume non-existing-image
2024/10/14 16:53:05 [adaptor/cloud] error starting instance: creating an instance : Error in creating volume: Can't retrieve volume non-existing-image
2024/10/14 16:53:05 [adaptor/proxy] shutting down socket forwarder
2024/10/14 16:53:05 [adaptor/cloud/libvirt] Deleting instance ()
2024/10/14 16:53:05 [adaptor/cloud/libvirt] Checking if instance (0) exists
2024/10/14 16:53:05 [adaptor/cloud/libvirt] Instance () not found
2024/10/14 16:53:05 [adaptor/cloud/libvirt] deleted an instance
2024/10/14 16:53:05 [adaptor/cloud] failed to release PeerPod pod to PeerPod mapping not found
2024/10/14 16:53:05 [tunneler/vxlan] Delete tc redirect filters on eth0 and ens3 in the network namespace /var/run/netns/cni-eeb8eed7-4b50-29a3-82dd-f918e031a4d7
2024/10/14 16:53:05 [adaptor/cloud] tearing down netns /var/run/netns/cni-eeb8eed7-4b50-29a3-82dd-f918e031a4d7: failed to tear down tunnel "vxlan": failed to delete a tc redirect filter from vxlan1 to eth0: failed to get interface vxlan1: Link not found
2024/10/14 16:53:11 [podnetwork] routes on netns /var/run/netns/cni-7ec115cb-fcd4-7c1a-7ef0-c654691dea1a
2024/10/14 16:53:11 [podnetwork]     0.0.0.0/0 via 10.244.1.1 dev eth0
2024/10/14 16:53:11 [podnetwork]     10.244.0.0/16 via 10.244.1.1 dev eth0
2024/10/14 16:53:11 [podnetwork]     10.244.1.0/24  dev eth0
2024/10/14 16:53:11 [adaptor/cloud] credential file /root/containers/auth.json is not present, skipping image auth config
2024/10/14 16:53:11 [adaptor/cloud] stored /run/peerpod/pods/b6720a96939fa36c57aa28f7f383190bc66f2be9129bca5dc4c92262c618e1da/daemon.json
2024/10/14 16:53:11 [adaptor/cloud] initdata in Pod annotation:
2024/10/14 16:53:11 [adaptor/cloud] initdata in pod annotation is empty, use global initdata:
2024/10/14 16:53:11 [adaptor/cloud] create a sandbox b6720a96939fa36c57aa28f7f383190bc66f2be9129bca5dc4c92262c618e1da for pod job-pi-lcdkc in namespace coco-pp-e2e-test-85277355 (netns: /var/run/netns/cni-7ec115cb-fcd4-7c1a-7ef0-c654691dea1a)
2024/10/14 16:53:11 [adaptor/cloud/libvirt] LaunchSecurityType: None
2024/10/14 16:53:11 [adaptor/cloud/libvirt] Checking if instance (podvm-job-pi-lcdkc-b6720a96) exists
2024/10/14 16:53:11 [adaptor/cloud/libvirt] failed to create an instance : Error in creating volume: Can't retrieve volume non-existing-image
2024/10/14 16:53:11 [adaptor/cloud] error starting instance: creating an instance : Error in creating volume: Can't retrieve volume non-existing-image

Hopefully this is enough info to give you a path to debug 🤞 but we can catch-up tomorrow

@ajaypvictor
Copy link
Contributor Author

@ajaypvictor - I've been able to reproduce this issue by doing export RUN_TESTS="\"(TestLibvirtCreatePeerPodContainerWithInvalidAlternateImage|TestLibvirtCreatePeerPodWithJob)\"". Describing the job shows that it doesn't have annotations, but is trying to create a podvm with volume non-existing-image, so I guess there might be some error in the code that is caching the annotation setting, or something?

Here is the caa log for the relevant section:

2024/10/14 16:53:05 [adaptor/cloud/libvirt] Choosing non-existing-image as libvirt volume for the PodVM image
2024/10/14 16:53:05 [adaptor/cloud/libvirt] Checking if instance (podvm-annotations-invalid-alternate-image-cd0a9e67) exists
2024/10/14 16:53:05 [adaptor/cloud/libvirt] failed to create an instance : Error in creating volume: Can't retrieve volume non-existing-image
2024/10/14 16:53:05 [adaptor/cloud] error starting instance: creating an instance : Error in creating volume: Can't retrieve volume non-existing-image
2024/10/14 16:53:05 [adaptor/proxy] shutting down socket forwarder
2024/10/14 16:53:05 [adaptor/cloud/libvirt] Deleting instance ()
2024/10/14 16:53:05 [adaptor/cloud/libvirt] Checking if instance (0) exists
2024/10/14 16:53:05 [adaptor/cloud/libvirt] Instance () not found
2024/10/14 16:53:05 [adaptor/cloud/libvirt] deleted an instance
2024/10/14 16:53:05 [adaptor/cloud] failed to release PeerPod pod to PeerPod mapping not found
2024/10/14 16:53:05 [tunneler/vxlan] Delete tc redirect filters on eth0 and ens3 in the network namespace /var/run/netns/cni-eeb8eed7-4b50-29a3-82dd-f918e031a4d7
2024/10/14 16:53:05 [adaptor/cloud] tearing down netns /var/run/netns/cni-eeb8eed7-4b50-29a3-82dd-f918e031a4d7: failed to tear down tunnel "vxlan": failed to delete a tc redirect filter from vxlan1 to eth0: failed to get interface vxlan1: Link not found
2024/10/14 16:53:11 [podnetwork] routes on netns /var/run/netns/cni-7ec115cb-fcd4-7c1a-7ef0-c654691dea1a
2024/10/14 16:53:11 [podnetwork]     0.0.0.0/0 via 10.244.1.1 dev eth0
2024/10/14 16:53:11 [podnetwork]     10.244.0.0/16 via 10.244.1.1 dev eth0
2024/10/14 16:53:11 [podnetwork]     10.244.1.0/24  dev eth0
2024/10/14 16:53:11 [adaptor/cloud] credential file /root/containers/auth.json is not present, skipping image auth config
2024/10/14 16:53:11 [adaptor/cloud] stored /run/peerpod/pods/b6720a96939fa36c57aa28f7f383190bc66f2be9129bca5dc4c92262c618e1da/daemon.json
2024/10/14 16:53:11 [adaptor/cloud] initdata in Pod annotation:
2024/10/14 16:53:11 [adaptor/cloud] initdata in pod annotation is empty, use global initdata:
2024/10/14 16:53:11 [adaptor/cloud] create a sandbox b6720a96939fa36c57aa28f7f383190bc66f2be9129bca5dc4c92262c618e1da for pod job-pi-lcdkc in namespace coco-pp-e2e-test-85277355 (netns: /var/run/netns/cni-7ec115cb-fcd4-7c1a-7ef0-c654691dea1a)
2024/10/14 16:53:11 [adaptor/cloud/libvirt] LaunchSecurityType: None
2024/10/14 16:53:11 [adaptor/cloud/libvirt] Checking if instance (podvm-job-pi-lcdkc-b6720a96) exists
2024/10/14 16:53:11 [adaptor/cloud/libvirt] failed to create an instance : Error in creating volume: Can't retrieve volume non-existing-image
2024/10/14 16:53:11 [adaptor/cloud] error starting instance: creating an instance : Error in creating volume: Can't retrieve volume non-existing-image

Hopefully this is enough info to give you a path to debug 🤞 but we can catch-up tomorrow

Thanks Steve for figuring this out and apologies for the miss from myside..

@ajaypvictor ajaypvictor force-pushed the multiple-podvm-images branch from c134ef1 to 4cebdcc Compare October 14, 2024 18:56
Enable support for multiple PodVM images for cloud providers with the merge of kata-containers/kata-containers#10274

Fixes confidential-containers#1282

Signed-off-by: Ajay Victor <ajvictor@in.ibm.com>
Enable support for multiple PodVM images for libvirt provider

Fixes confidential-containers#1282

Signed-off-by: Ajay Victor <ajvictor@in.ibm.com>
Tests for multiple PodVM images with an invalid non existing volume and a copy of the valid volume with a different name

Fixes confidential-containers#1282

Signed-off-by: Ajay Victor <ajvictor@in.ibm.com>
@ajaypvictor ajaypvictor force-pushed the multiple-podvm-images branch from 4cebdcc to 589382d Compare October 14, 2024 18:57
@ajaypvictor
Copy link
Contributor Author

I think the edge case was that the CAA would continue to use the annotation set previously rather than switching to default for the next podvm, fixed this case.

test:~/cloud-api-adaptor/src/cloud-api-adaptor# RUN_TESTS="\"(TestLibvirtCreatePeerPodContainerWithInvalidAlternateImage|TestLibvirtCreatePeerPodWithJob)\"" TEST_PROVISION=no TEST_INSTALL_CAA=no make TEST_PROVISION_FILE=/root/cloud-api-adaptor/src/cloud-api-adaptor/libvirt.properties CLOUD_PROVIDER=libvirt test-e2e
go test -v -tags=libvirt -timeout 60m -count=1 -run "(TestLibvirtCreatePeerPodContainerWithInvalidAlternateImage|TestLibvirtCreatePeerPodWithJob)" ./test/e2e
time="2024-10-14T18:54:25Z" level=info msg="Do setup"
time="2024-10-14T18:54:25Z" level=info msg="Container runtime: containerd"
time="2024-10-14T18:54:25Z" level=info msg="Creating namespace 'coco-pp-e2e-test-19cb05f1'..."
time="2024-10-14T18:54:25Z" level=info msg="Wait for namespace 'coco-pp-e2e-test-19cb05f1' be ready..."
time="2024-10-14T18:54:30Z" level=info msg="Wait for default serviceaccount in namespace 'coco-pp-e2e-test-19cb05f1'..."
time="2024-10-14T18:54:30Z" level=info msg="default serviceAccount exists, namespace 'coco-pp-e2e-test-19cb05f1' is ready for use"
=== RUN   TestLibvirtCreatePeerPodContainerWithInvalidAlternateImage
=== RUN   TestLibvirtCreatePeerPodContainerWithInvalidAlternateImage/PodVMwithAnnotationsInvalidAlternateImage_test
=== RUN   TestLibvirtCreatePeerPodContainerWithInvalidAlternateImage/PodVMwithAnnotationsInvalidAlternateImage_test/Failed_to_Create_PodVM_with_a_non-existent_image
=== NAME  TestLibvirtCreatePeerPodContainerWithInvalidAlternateImage/PodVMwithAnnotationsInvalidAlternateImage_test
    assessment_runner.go:642: Deleting pod annotations-invalid-alternate-image...
    assessment_runner.go:649: Pod annotations-invalid-alternate-image has been successfully deleted within 60s
--- PASS: TestLibvirtCreatePeerPodContainerWithInvalidAlternateImage (10.02s)
    --- PASS: TestLibvirtCreatePeerPodContainerWithInvalidAlternateImage/PodVMwithAnnotationsInvalidAlternateImage_test (10.02s)
        --- PASS: TestLibvirtCreatePeerPodContainerWithInvalidAlternateImage/PodVMwithAnnotationsInvalidAlternateImage_test/Failed_to_Create_PodVM_with_a_non-existent_image (0.00s)
=== RUN   TestLibvirtCreatePeerPodWithJob
=== RUN   TestLibvirtCreatePeerPodWithJob/JobPeerPod_test
=== RUN   TestLibvirtCreatePeerPodWithJob/JobPeerPod_test/Job_has_been_created
    assessment_helpers.go:292: SUCCESS: job-pi-5nw8k - Completed - LOG: 3.14156
    assessment_runner.go:342: Output Log from Pod: 3.14156
=== NAME  TestLibvirtCreatePeerPodWithJob/JobPeerPod_test
    assessment_runner.go:625: Deleting Job... job-pi
    assessment_runner.go:632: Deleting pods created by job... job-pi-5nw8k
--- PASS: TestLibvirtCreatePeerPodWithJob (90.04s)
    --- PASS: TestLibvirtCreatePeerPodWithJob/JobPeerPod_test (90.03s)
        --- PASS: TestLibvirtCreatePeerPodWithJob/JobPeerPod_test/Job_has_been_created (0.01s)
PASS
time="2024-10-14T18:56:10Z" level=info msg="Deleting namespace 'coco-pp-e2e-test-19cb05f1'..."
time="2024-10-14T18:56:20Z" level=info msg="Namespace 'coco-pp-e2e-test-19cb05f1' has been successfully deleted within 60s"
ok  	github.com/confidential-containers/cloud-api-adaptor/src/cloud-api-adaptor/test/e2e	115.135s
root@ajay-trustee:~/cloud-api-adaptor/src/cloud-api-adaptor#

@stevenhorsman stevenhorsman merged commit 8f258c7 into confidential-containers:main Oct 15, 2024
28 checks passed
@ajaypvictor ajaypvictor deleted the multiple-podvm-images branch October 15, 2024 08:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test_e2e_libvirt Run Libvirt e2e tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow modifying the VM image used for the podvm instance
4 participants